Intelligent surgical workflow recognition for endoscopic submucosal dissection with real-time animal study

Recent advancements in artificial intelligence have witnessed human-level performance; however, AI-enabled cognitive assistance for therapeutic procedures has not been fully explored nor pre-clinically validated. Here we propose AI-Endo, an intelligent surgical workflow recognition suit, for endoscopic submucosal dissection (ESD). Our AI-Endo is trained on high-quality ESD cases from an expert endoscopist, covering a decade time expansion and consisting of 201,026 labeled frames. The learned model demonstrates outstanding performance on validation data, including cases from relatively junior endoscopists with various skill levels, procedures conducted with different endoscopy systems and therapeutic skills, and cohorts from international multi-centers. Furthermore, we integrate our AI-Endo with the Olympus endoscopic system and validate the AI-enabled cognitive assistance system with animal studies in live ESD training sessions. Dedicated data analysis from surgical phase recognition results is summarized in an automatically generated report for skill assessment.


2). Three-step annotation schedule
Our work adopted a three-step annotation schedule to annotate developmental and external datasets.
The entire annotation process is as follows: Case53-
(1) To ensure the quality of the annotation process, we first randomly selected approximately 10% (5 cases, 20,446 frames) of developmental dataset and assigned two annotators to individually annotate the data using the dataflow outlined in Supplementary Figure 1, following predefined protocols.This step served as a quality control measure for the annotations, as well as allowing annotators to become familiar with the annotation workflow.To quantitatively evaluate the interobserver agreement between the two raters, we used the Pearson correlation coefficient (PCC) [1].The resulting PCC for the selected samples was 0.93, indicating a high degree of agreement between raters.To visually examine disagreements between the annotations, we display phase annotation examples of two annotators in Supplementary Figure 3.For the majority of the video, the annotations of the two raters were nearly identical, with only a small percentage (3.30%)exhibiting ambiguity between the annotations of different raters.
(2) Considering the high level of consistency in the initial annotation results of the two annotators, we divided all annotation tasks into roughly equal halves, with each rater individually annotating one part.For instance, in the case of the 47 cases in the expert dataset, we divided the dataset into two non-overlapping parts of 24 and 23 cases, respectively.The two annotators then individually annotated each part, resulting in 108,286 frames for the first part and 92,740 frames for the second part.Given the high level of agreement between the annota-tors, we treated the two parts of annotations as the final annotation for the expert dataset.We employed the same strategy to distribute the datasets to raters for all external datasets.
(3) Following the completion of all annotation tasks by the two annotators, we subjected the annotations to quality control by two experienced endoscopists with six and three years of experience.Their focus was on correcting challenging annotations that even trained annotators may struggle with.Both endoscopists relied on visual cues and practical surgical experience to determine the appropriate surgical phase in complex surgical sites or when key anatomical landmarks were obscured.To facilitate the verification process, we provided synchronized information by overlaying the annotation results from the second step onto the raw video.The endoscopists reviewed the video and corrected any errors as needed.In the case of any discrepancies, the two experienced endoscopists discussed for a consensus label, while with more respect to the relatively senior one who has conducted more ESD cases.This step yielded the final phase annotation dataset.We have visualized the steps involved in calculating the orderliness metric of phase Dissection in Supplementary Figure 4.In Supplementary Figure 4a, we obtained the optimized threshold t * from the ROC, and the inset illustrates the specific process we used to calculate orderliness, as described above.Additionally, we have depicted the relationship between the threshold t * and the probability distributions of Dissection and non-Dissection frames in Supplementary Figure 4b.
Ideally, t * would be able to perfectly discriminate between the boundaries of these two distributions, which means the output probability of Dissection frames should be higher than that of non-Dissection frames.However the AI model may fail to recognize challenging scenarios, such as a blurry view or obscured surgical tools.This can result in prediction errors, i.e., F P and F N , consequently dividing all samples into four groups, as shown in the figure.To assess the performance of the AI model with respect to each phase, we define the orderliness to calculate the proportion of frames that are correctly classified, i.e., T P and T N . Supplementary Figure 1: The workflow of phase annotation.Blue lines represent the stages we prepared data for annotation, and yellow lines denote the annotation process.The annotation examples of start and end frames of phases Marking, Injection, and Dissection at 1 fps.

Supplementary Figure 4 :
Definition of metrics Orderliness.a The steps for calculating Orderliness; b The distributions of output probabilities corresponding to Dissection (in red) and non-Dissection (in purple) frames.
The curves of the training loss of (a) ResNet50 module in the 1 st stage and (b) Fusion and Transformer modules in the 2 nd stage.The dashed line indicates the number of iterations we actually trained with.

Table 2 :
Performance metrics on external dataset from different surgeons and skills.Source data are provided as a Source Data file.

Table 3 :
Performance metrics on ex-vivo animal trial dataset.Source data are provided as a Source Data file.

Table 4 :
Performance metrics on in-vivo animal trial dataset.Source data are provided as a source data file.